Additional File 2 — Use Case: Investigate Input Features for Carcinogenicity Models
نویسندگان
چکیده
Additional file for the article " CheS-Mapper 2.0 for Visual Validation of (Q)SAR models " We visually validate the effect of exchanging the descriptors used by a (Q)SAR algorithm. To this end, we select a subset of the Carcinogenic Potency Database (CPDB) [1] for various species. The database contains 86 compounds that have an activity value for hamster carcinogenicity assigned (active or inactive). We compute two different sets of features for these compounds with CheS-Mapper: 308 physico-chemical (PC) descriptors using CDK and Open Babel, and 287 structural fragments. The structural features have been calculated by matching the compounds with three predefined SMARTS lists included in Open Babel. We choose the random forest implementation from the WEKA workbench as classification algorithm and compare two different approaches: (Q)SAR-1 is built using only the physico-chemical descriptors, while (Q)SAR-2 exploits a combination of both feature sets. We apply a 5-times repeated 10-fold cross-validation to validate both variants. (Q)SAR-2 achieved a classification accuracy of 0.75, and significantly outperformed (Q)SAR-1 that had a classification accuracy of only 0.67. Apparently, using both feature types allows to build a more predictive model. As CheS-Mapper's 3D embedding is based on the features, we start the program twice to (simultaneously) compare the effect of using different feature sets. When highlighting the actual endpoint value, we note that the compounds are roughly separated according to their class value. The separation (and thus the decision boundary) is less distinctive when using only PC features (Figure 1) compared to adding structural fragments (Figure 2). This indicates that it is easier for (Q)SAR-2 to predict the endpoint, than for the (Q)SAR-1 approach. Comparing the misclas-sifications of both approaches, we detect two compounds that have always been correctly classified by (Q)SAR-2, but not by (Q)SAR-1. The inactive compound Isonicotinic acid (DSSTox-RID 20757) is selected in Figure 2 (marked with a label and drawn as 2D picture at the top right-hand side). In the embedding based on both feature types it is located in entirely inactive space. It is correctly classified by (Q)SAR-2 in 5 of 5 repetitions of the cross-validation. In contrast, this compound was misclassified as active 2 out of 5 times by (Q)SAR-1. As previously described, the feature list at the top right-hand side is sorted according to specificity. Hence, carboxylic acid is the structural feature that distinguishes this compound the most from the remaining dataset compounds. With the help of CheS-Mapper, we …
منابع مشابه
Input-induced Variation in EFL Learners’ Oral Production in Terms of Complexity, Accuracy, and Fluency
Researchers have extensively studied phenomena that affect a second language learner’s oral production while there is scant evidence about input-related factors. Accordingly, the present study sought to investigate how variation in oral production is caused by the input they receive from different course materials. To this end, the study included a micro-evaluation study of three course materia...
متن کاملAdditive SMILES-Based Carcinogenicity Models: Probabilistic Principles in the Search for Robust Predictions
Optimal descriptors calculated with the simplified molecular input line entry system (SMILES) have been utilized in modeling of carcinogenicity as continuous values (logTD(50)). These descriptors can be calculated using correlation weights of SMILES attributes calculated by the Monte Carlo method. A considerable subset of these attributes includes rare attributes. The use of these rare attribut...
متن کاملCarveML: application of machine learning to file fragment classification
We present a learning algorithmic approach to the problem of recognzing the file types of file fragments, with the purpose of applying this to “file carving”, the reconstruction of partially erased files on disk into whole files. We do so through the use of 257 calculated features of an input fragment, applying the Support Vector Machine, Multinomial Naive Bayes, and Linear Discriminant Analysi...
متن کاملAnalyzing the effects of urban development on flooding in the cities (Case study: Birjand City)
It is increasingly recognized that the land-use change, especially urbanization has influenced hydrological attributes intensely. But in most urban designs, flood prediction is considered through type of land use (residential, industrial, and so on) and density. However, experiences show that this method has not been very successful. As a result, the present study aims to investigate and explor...
متن کاملAn Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition
Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014